Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5 summarise clusters #6

Merged
merged 29 commits into from
Nov 12, 2024
Merged

5 summarise clusters #6

merged 29 commits into from
Nov 12, 2024

Conversation

RFOxbury
Copy link
Contributor

@RFOxbury RFOxbury commented Oct 8, 2024

Fixes issues #5 (topic summarisation) and #7 (semantic chunking)

To review:

  • Run python dsp_interview_transcripts/pipeline/run_pipeline.py / follow the instructions in this readme. Check that everything runs and the outputs are as expected.

There is also a notebook on semantic chunking but only look at that if you are interested - we're not taking that route for now.

@RFOxbury RFOxbury linked an issue Oct 8, 2024 that may be closed by this pull request
@RFOxbury RFOxbury marked this pull request as ready for review November 7, 2024 15:12
@RFOxbury RFOxbury requested a review from beingkk November 7, 2024 15:12
Copy link
Contributor

@beingkk beingkk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Rosie, I've looked through most of the code and test ran it - it's all good!

I've made a few small comments (in addition to the ones I sent over Slack) - take a look and see if you'd like to address them now.

Otherwise, happy for you to merge and continue with the evals and dashboard.

I also added a few tests that helped me better understand the data processing functions. Didn't have time to do "test-driven reviewing" for the rest of the code, however.

# Fix improperly coded characters
data_df['text_clean'] = data_df['text'].apply(lambda x: ftfy.fix_text(x))

data_df["text_clean"] = data_df["text"].apply(lambda x: ftfy.fix_text(x))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I didn't now about ftfy.fix_text

def name_topics(
topic_info: pd.DataFrame,
llm_chain,
topics: List[str],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this topics variable could be removed and you could move the line

 topics = topic_info["Cluster"].unique().tolist()

inside this function, to simplify the input variables.


if __name__ == "__main__":

errors = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you doing anything with this list in this script?


def match_questions(bot_qs_list: List[str], questions: List[str], threshold: float = 0.85) -> pd.DataFrame:
"""Match the input questions from our interview template,
and the actual questions produced by the bot, using cosine similarity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and the actual questions produced by the bot, using cosine similarity.
and the actual sentences produced by the bot, using cosine similarity.

I guess you're not just matching questions but in fact all sentences?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, yes this is correct! Good spot

Comment on lines 254 to 255
interviews_cleaned_df = interviews_df.groupby("conversation").apply(remove_preamble).reset_index(drop=True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like you're using or saving interviews_cleaned_df after this...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a good spot - thank you!!

dsp_interview_transcripts/pipeline/README.md Outdated Show resolved Hide resolved
dsp_interview_transcripts/pipeline/name_clusters.py Outdated Show resolved Hide resolved
Comment on lines +85 to +87
topic_counts = pd.DataFrame(data["Cluster"].value_counts()).reset_index()
topic_counts = topic_counts.rename(columns={"count": "N responses in topic"})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, but I would like us to use pandas chaining as much as possible as it makes it easier to read the code.

    topic_counts = (
        pd.DataFrame(data["Cluster"].value_counts())
        .reset_index()
        .rename(columns={"count": "N responses in topic"})
    )

Comment on lines +88 to +91
data_w_names = data_w_names.rename(columns={"llama3.2_name": "Name", "llama3.2_description": "Description"})
data_w_names = pd.merge(data_w_names, topic_counts, left_on="Cluster", right_on="Cluster", how="left")

data_w_names[["Name", "Description", "Top Words", "N responses in topic"]].to_csv(OUTPUT_PATH_SUMMARY, index=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same point here about chains

Comment on lines 97 to 100
data_viz = pd.merge(data, data_w_names[["Cluster", "Name", "Description"]], on="Cluster", how="left")

data_viz["Name"] = data_viz["Name"].fillna("None")
data_viz["Description"] = data_viz["Description"].fillna("None")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can use chains here and, for example, the function .assign(Name = lambda df: df['Name'].fillna("None"))

@RFOxbury RFOxbury merged commit 88e233a into dev Nov 12, 2024
@RFOxbury RFOxbury deleted the 5-summarise-clusters branch November 12, 2024 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Summarise clusters
2 participants